Red Wine Data Analysis by Sourabh Dev

Lets start by looking at the data summary

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

From this summary we can see some broad categories like: acidity, sugar, chemical groups, quality, alcohol content.

Univariate Plots Section

Lets start by plotting the quality

This looks like a normal distribution.

To continue this analysis further, lets look at the: density, alcohol levels and sugar.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The density looks like a normal distribution and the alcohol data is a little skewed. We can see a large spike in the alcohol level around 9.5%.

Sugar seems to be skewed drastically, it would make sense to test it on a log scale.

Nothing significant can be seen here.

Now, lets look at the acidity.

pH seems to follow a normal distribution, with the largest concentration around 3.3.

Looks like the fixed and volatile acidity seems to skewed. But, no pattern is visible in case of the citric acid levels. So, lets further explore it.

It seems skewed when measured on a log scale.

Finally, lets explore the chemical levels.

These plots look like normal distributions if we remove the outliers.

Both distributions are skewed. # Univariate Analysis

What is the structure of your dataset?

The are 1599 different wine bottles and the dataset has 13 features (“fixed.acidity”,“volatile.acidity”,“citric.acid”,“residual.sugar”,“chlorides”,“free.sulfur.dioxide”,“total.sulfur.dioxide”,“density”,“pH”,“sulphates”,“alcohol”,“quality”).

Some interesting observations: * Majority of the wines are rate a quality of 5 or 6. * The alcohol levels are skewed with a large spike at 9.5%. * The median pH values is at 3.31.

What is/are the main feature(s) of interest in your dataset?

The main feature in this dataset is the quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The main features of interest are citric.acid, residual.sugar, ph and alcohol. It would be interesting to see how these variables effect the quality.

Did you create any new variables from existing variables in the dataset?

No.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Citric acid and Alcohol seem to be a little unusual. Alcohol seems to have a skewed distribution with a sudden did, it’s looks almost bimodal. While citric acid is skewed on the log scale along the x axis.

No aditional changes were made.

Bivariate Plots Section

##                  fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity       1.00000000     -0.256130895   0.6717034    0.114776724
## volatile.acidity   -0.25613089      1.000000000  -0.5524957    0.001917882
## citric.acid         0.67170343     -0.552495685   1.0000000    0.143577162
## residual.sugar      0.11477672      0.001917882   0.1435772    1.000000000
## density             0.66804729      0.022026232   0.3649472    0.355283371
## pH                 -0.68297819      0.234937294  -0.5419041   -0.085652422
## alcohol            -0.06166827     -0.202288027   0.1099032    0.042075437
## quality             0.12405165     -0.390557780   0.2263725    0.013731637
##                      density          pH     alcohol     quality
## fixed.acidity     0.66804729 -0.68297819 -0.06166827  0.12405165
## volatile.acidity  0.02202623  0.23493729 -0.20228803 -0.39055778
## citric.acid       0.36494718 -0.54190414  0.10990325  0.22637251
## residual.sugar    0.35528337 -0.08565242  0.04207544  0.01373164
## density           1.00000000 -0.34169933 -0.49617977 -0.17491923
## pH               -0.34169933  1.00000000  0.20563251 -0.05773139
## alcohol          -0.49617977  0.20563251  1.00000000  0.47616632
## quality          -0.17491923 -0.05773139  0.47616632  1.00000000

Lets draw a correlation plot to have a better understaing.

From the above table and plot matrix we see “fixed.acidity”, “volatile.acidity” and “pH” has some correlation with “citric.acid”. Interestingly, density has some correlation with “fixed.acidity” and “alcohol”. Also, “quality” has some correlation with “alcohol”.

Lets now look at pH, fixed.acidity and volatile.acidity versus citric.acid.

## 
## Call:
## lm(formula = citric.acid ~ pH, data = analysis_winedata)
## 
## Coefficients:
## (Intercept)           pH  
##      2.5350      -0.6838

From the scatter plot we can see that the data seems to be slightly negatively correlated.

## 
## Call:
## lm(formula = citric.acid ~ fixed.acidity, data = analysis_winedata)
## 
## Coefficients:
##   (Intercept)  fixed.acidity  
##      -0.35427        0.07515

From the scatter plot we can see that the data seems to be slightly positively correlated.

## 
## Call:
## lm(formula = citric.acid ~ volatile.acidity, data = analysis_winedata)
## 
## Coefficients:
##      (Intercept)  volatile.acidity  
##           0.5882           -0.6011

This data looks very similar to pH vs citric acid levels. Maybe pH and volatile.acidity have some relationship. Let’s try to plot it.

## 
## Call:
## lm(formula = pH ~ volatile.acidity, data = analysis_winedata)
## 
## Coefficients:
##      (Intercept)  volatile.acidity  
##           3.2042            0.2026

There definitly seems to be some sort of correlation here.

Now, lets look at denisty vs alcohol and density vs fixed.acidity.

## 
## Call:
## lm(formula = density ~ alcohol, data = analysis_winedata)
## 
## Coefficients:
## (Intercept)      alcohol  
##   1.0059059   -0.0008788

The general trend here seems to be that alcohol levels decrease with density. Which does make sense as alcohol is lighter than water and more alcohol means less water, hence lower density.

There is a clearcut linear relationship between fixed acidity and density. The acidity goes up with the density.

Now, lets more to the most interesting plot between alcohol and quality.

## $`3`
##    vars  n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 10 9.96 0.82   9.93   10.02 0.78 8.4  11   2.6 -0.41    -0.99 0.26
## 
## $`4`
##    vars  n  mean   sd median trimmed  mad min  max range skew kurtosis
## X1    1 53 10.27 0.93     10   10.21 1.19   9 13.1   4.1 0.61    -0.23
##      se
## X1 0.13
## 
## $`5`
##    vars   n mean   sd median trimmed  mad min  max range skew kurtosis
## X1    1 681  9.9 0.74    9.7    9.79 0.44 8.5 14.9   6.4 1.83     5.25
##      se
## X1 0.03
## 
## $`6`
##    vars   n  mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 638 10.63 1.05   10.5   10.56 1.19 8.4  14   5.6 0.54    -0.16
##      se
## X1 0.04
## 
## $`7`
##    vars   n  mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 199 11.47 0.96   11.5   11.47 1.04 9.2  14   4.8 0.01    -0.47
##      se
## X1 0.07
## 
## $`8`
##    vars  n  mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 18 12.09 1.22  12.15   12.12 1.19 9.8  14   4.2 -0.2    -0.98 0.29
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)

There seems to be a positive correlation, except in the case of wines rates 5 in quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Most of the comparisons made with citric acid showed some type of linear realtionship.

The comparision between alcohol and density proved the hypothesis that wines having low alcohol levels have high concentration of water, hence lower higher in density as water is more dense.

Finally, quality and alcohol showed an increasing linear relationship. But, there is a suddent dip in case of wine with quality ‘5’.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

As mentioned above the dip in quality vs alcohol is very intersting.

What was the strongest relationship you found?

pH and fixed acidity seem to have the strongest correlation.

Multivariate Plots Section

In the above plot of Alcohol vs Density vs Quality. We can see that alcohols rated 5 in quality are on the more denser while having low alcohol content.

No significant observations can be derived from this plot.

There are no interesting patterns here.

Clearly acidity varies negatively with the pH. But, the quality seems to be uniform.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

From the first graph it seems to be that the density has a inverse relationship with quality. Denser the wine, lower it’s score.

Were there any interesting or surprising interactions between features?

No.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

The above three graphs show how different acidity levels are distributed thoughout the dataset.

Both, fixed and volatile acidity level have a normal distribution, which is as expected.

Their seems to be spikes in the citric acid instead of the expected normal distributions.

Plot Two

Description Two

This is boxplot of quality of wine versus alcohol content distributed as per their quality levels. The general expectation was to see a linear relationship between the two variables. That seems to be the general trend.

But, there seems to be a dip at quality ‘5’.

Plot Three

Description Three

This is a Multivariate plot showing the relationship between Alcohol, Sugar and Quality.

We can see, even though the alcohol levels vary widely with sugar, there is no clear preference for wines with lower amount of residual sugar. The sugar levels are all over the graph.

Reflection

This analysis was conducted conducted with the view of trying to uncover hidden insights by move a step at a time and proceeding further or retracting backwards based on the outcome. It was at times unbelievable at times when the hypothesis was incorrect, but it did make sense. The most important thing that influenced the direction on the analysis was some sort of patterns that unravelled.

In the future analysis, it would make sense to carry out analysis based on the chemical compositions.

The take aways from this analysis are that wines with high quality tend to have higher alcohol content and low residual sugar. Another interesting finding was that citric acidity decreases with pH levels. So, wines with lower acidty have higher citric acid content.

In conclusion, if you are looking for a good bottle of wine. It will most like have very little sweetness to it, but will be strong.